Benchmarks

Performance comparison of word filtering functions (Part 2)

Introduction

For one of my programs, I have to filter and delete some words from a long text from a long text. I need to remove pronouns, conjugated verbs, determiners and some other words. some other words.

In the previous tutorial, I presented and compared several tree-like data structures to speed up the sorting of these words.

In order to complete this comparison, I wanted to test a last structure by taking into account the last results

Multi-branch custom tree (WordTree4x)

In my opinion, the idea of putting several letters in a single branch as I did with the WordTree2g and WordTree3g trees is the right one. The bad performance of these trees comes from the distribution test itself. Indeed to select the right letters to put in each branch, I proceeded to a test with the elem function which slows down a lot the program.

The solution would be to replace this function with a simpler test that is less time consuming. For example a test of equality == or of order <, >, <= or >=. But if we want to do this, we will have to perform the distribution tests according to the alphabetical order of the letters. The idea of using a specific letter order will therefore be abandoned. On the other hand, the grouping of letters will be specific and adapted to the words to be sorted.

By taking again the statistics of the letters, we will have:

  1. A 77

  2. B 4

  3. C 34

  4. D 27

  5. E 133

  6. F 19

  7. G 9

  8. H 4

  9. I 85

  10. J 5

  11. L 46

  12. M 18

  13. N 59

  14. O 61

  15. P 21

  16. Q 21

  17. R 49

  18. S 103

  19. T 90

  20. U 102

  21. V 21

  22. X 9

  23. Y 4

  24. Z 8

For a total of 1009 letters.

For each version, we will place the cuts in order to minimize the difference in the number of elements in each branch.

  • If we cut at the letter M (included), we will have :

    Branch

    Nb of elems

    A-M

    461

    N-Z

    548

    The maximum deviation with a perfect distribution of the elements is |548-504.5|=43.5

  • If we cut at the letter N (included), we will have :

    Branch

    Nb of elems

    A-M

    520

    N-Z

    489

    The maximum deviation with a perfect distribution of the elements is |489-504.5|=15.5

The gap is therefore smaller with a cut at the letter N (included).

And we will apply this method to all other trees.

Implémentations

Several tests will be performed with a different number of branches in order to see the influence of this number on the performance of the tree.

We will have the following implementation:

2-branchs custom tree (WordTree42)

member v0 tree = go v0 tree
    where
        go _  Tip                 = False
        go _  t@(Fork vs Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2) | a <= 'l'  = go as t1
                                    | otherwise = go as t2

        go _ _ = False

3-branchs custom tree (WordTree43)

member v0 tree = go v0 tree
    where
        go _  Tip                     = False
        go _  t@(Fork vs Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3) | a <= 'h'  = go as t1
                                       | a <= 'r'  = go as t2
                                       | otherwise = go as t3

        go _ _ = False

4-branchs custom tree (WordTree44)

member v0 tree = go v0 tree
    where
        go _  Tip                         = False
        go _  t@(Fork vs Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4) | a <= 'e'  = go as t1
                                          | a <= 'n'  = go as t2
                                          | a <= 's'  = go as t3
                                          | otherwise = go as t4

        go _ _ = False

5-branchs custom tree (WordTree45)

member v0 tree = go v0 tree
    where
        go _  Tip                             = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5) | a <= 'd'  = go as t1
                                             | a <= 'i'  = go as t2
                                             | a <= 'q'  = go as t3
                                             | a <= 't'  = go as t4
                                             | otherwise = go as t5

        go _ _ = False

6-branchs custom tree (WordTree46)

member v0 tree = go v0 tree
    where
        go _  Tip                                 = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6) | a <= 'd'  = go as t1
                                                | a <= 'h'  = go as t2
                                                | a <= 'm'  = go as t3
                                                | a <= 'r'  = go as t4
                                                | a <= 't'  = go as t5
                                                | otherwise = go as t6

        go _ _ = False

7-branchs custom tree (WordTree47)

member v0 tree = go v0 tree
    where
        go _  Tip                                     = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7) | a <= 'd'  = go as t1
                                                   | a <= 'g'  = go as t2
                                                   | a <= 'm'  = go as t3
                                                   | a <= 'q'  = go as t4
                                                   | a <= 's'  = go as t5
                                                   | a == 't'  = go as t6
                                                   | otherwise = go as t7

        go _ _ = False

8-branchs custom tree (WordTree48)

member v0 tree = go v0 tree
    where
        go _  Tip                                         = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8) | a <= 'd'  = go as t1
                                                      | a <= 'f'  = go as t2
                                                      | a <= 'l'  = go as t3
                                                      | a <= 'o'  = go as t4
                                                      | a <= 'r'  = go as t5
                                                      | a == 's'  = go as t6
                                                      | a == 't'  = go as t7
                                                      | otherwise = go as t8

        go _ _ = False

9-branchs custom tree (WordTree49)

member v0 tree = go v0 tree
    where
        go _  Tip                                             = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9) | a <= 'c'  = go as t1
                                                         | a <= 'e'  = go as t2
                                                         | a <= 'i'  = go as t3
                                                         | a <= 'm'  = go as t4
                                                         | a <= 'o'  = go as t5
                                                         | a <= 'r'  = go as t6
                                                         | a == 's'  = go as t7
                                                         | a == 't'  = go as t8
                                                         | otherwise = go as t9

        go _ _ = False

10-branchs custom tree (WordTree410)

member v0 tree = go v0 tree
    where
        go _  Tip                                                 = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10) | a <= 'c'  = go as t1
                                                             | a <= 'e'  = go as t2
                                                             | a <= 'i'  = go as t3
                                                             | a <= 'm'  = go as t4
                                                             | a == 'n'  = go as t5
                                                             | a <= 'q'  = go as t6
                                                             | a == 'r'  = go as t7
                                                             | a == 's'  = go as t8
                                                             | a == 't'  = go as t9
                                                             | otherwise = go as t10

        go _ _ = False

11-branchs custom tree (WordTree411)

member v0 tree = go v0 tree
    where
        go _  Tip                                                     = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11) | a <= 'c'  = go as t1
                                                                 | a <= 'e'  = go as t2
                                                                 | a <= 'h'  = go as t3
                                                                 | a <= 'j'  = go as t4
                                                                 | a <= 'm'  = go as t5
                                                                 | a == 'n'  = go as t6
                                                                 | a <= 'q'  = go as t7
                                                                 | a == 'r'  = go as t8
                                                                 | a == 's'  = go as t9
                                                                 | a == 't'  = go as t10
                                                                 | otherwise = go as t11

        go _ _ = False

12-branchs custom tree (WordTree412)

member v0 tree = go v0 tree
    where
        go _  Tip                                                         = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12) | a <= 'c'  = go as t1
                                                                     | a <= 'e'  = go as t2
                                                                     | a <= 'h'  = go as t3
                                                                     | a <= 'j'  = go as t4
                                                                     | a <= 'm'  = go as t5
                                                                     | a == 'n'  = go as t6
                                                                     | a <= 'q'  = go as t7
                                                                     | a == 'r'  = go as t8
                                                                     | a == 's'  = go as t9
                                                                     | a == 't'  = go as t10
                                                                     | a == 'u'  = go as t11
                                                                     | otherwise = go as t12

        go _ _ = False

Results

The results are presented in the following reports and summary table:

List

Set

WordTree 3

WordTree 42

WordTree 43

WordTree 44

WordTree 45

WordTree 46

WordTree 47

WordTree 48

WordTree 49

WordTree 410

WordTree 411

WordTree 412

sec

100,11463

9,08085

6,80515

7,25959

5,98261

5,84642

5,70404

5,69355

5,77493

5,95812

5,85358

5,90960

5,97161

6,05108

xfaster/List

/

11,0248

14,7116

13,7907

16,7343

17,1241

17,5515

17,5839

17,3361

16,8030

17,1031

16,9410

16,7651

16,5449

%lower/List

/

-90,93 %

-93,20 %

-92,75 %

-94,02 %

-94,16 %

-94,30 %

-94,31 %

-94,23 %

-94,05 %

-94,15 %

-94,10 %

-94,04 %

-93,96 %

xfaster/Set

/

1,00

1,3344

1,25

1,52

1,55

1,59

1,59

1,57

1,52

1,55

1,54

1,52

1,50

%lower/Set

/

/

-10,00 %

-20,06 %

-34,12 %

-35,62 %

-37,19 %

-37,30 %

-36,41 %

-34,39 %

-35,54 %

-34,92 %

-34,24 %

-33,36 %

xfaster/WordTree3

/

/

1,00

0,94

1,14

1,16

1,19

1,20

1,18

1,14

1,16

1,15

1,14

1,12

%lower/WordTree3

/

/

/

6,68 %

-12,09 %

-14,09 %

-16,18 %

-16,33 %

-15,14 %

-12,45 %

-13,98 %

-13,16 %

-12,25 %

-11,08 %

!!! File report-Voyage-full2_p1.[1131x1600@256d].png not found !!!!!! File report-global2_p1.[1131x1600@256d].png not found !!!

Conlusion

It can be seen that the processing speed is the lowest with 5 or 6 branches in the tree. The execution speed is then:

  • About 17 times faster than processing with a list.

  • 1.59 times faster than the Set.

  • 1.17/1.19 times faster than the WordTree 3 structure.

This structure is therefore the most optimized one I could find for word filtering.

I want to remind that this structure, the number of branches and the distribution of letters in these branches is only valid for the word list to be filtered. A different list, in another language, will most certainly give different results and will require readjustments to keep the performance optimal.

And remember also that for most purposes the Set structure provided by the container module of Haskell will fit perfectly, will be more general than the structure presented here and will save development time.